<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
	<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">
	<link rel="stylesheet" href="/assets/css/atom-one-light.css">
    
        <title>Two sample Kolmogorov-Smirnov test</title>
		<link rel="stylesheet" type="text/css" href="/assets/css/002.css">
    
	<link rel="stylesheet" href="/assets/css/font-awesome.min.css">
	<link rel="shortcut icon" href="/assets/img/favicon.ico" type="image/x-icon">
	<link rel="icon" href="/assets/img/favicon.ico" type="image/x-icon">
	<script src="/assets/js/highlight.pack.js"></script>
	<script>hljs.initHighlightingOnLoad();</script>

	
		<script type="text/x-mathjax-config">
  MathJax.Hub.Config({
    TeX: {
      equationNumbers: {
        autoNumber: "AMS"
      }
    },
    tex2jax: {
      inlineMath: [ ['$','$'] ],
      displayMath: [ ['$$','$$'] ],
      processEscapes: true,
    }
  });
</script>
<script type="text/javascript"
        src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML">
</script>

	

	<script async src="https://www.googletagmanager.com/gtag/js?id=UA-140127665-1"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'UA-140127665-1');
</script>


</head>
<body>
	<div class="wrapper">
		<div class="default_title">
			<img src="/assets/img/mycomputer.png" />
			
				<h1>NaNg's blog</h1>
			
		</div>
		<ul class="topbar">
	<a href="/pages/me.html"><li><u>A</u>bout</li></a>
	<a href="/pages/links.html"><li><u>L</u>inks</li></a>
	<a href="https://www.dropbox.com/sh/nhy3m3cvojizkk5/AABH8mt5gh3PiBrtWnCxE87ma?dl=0"><li><u>D</u>ropbox</li></a>
</ul>
		<div class="tag_list">
			<ul id="tag-list">
				<li><a href="/" ><img src="/assets/img/disk.png" />(C:)</a>
			<ul>
				
				
				<li><a href="/tag/3d/" title="3d"><img src="/assets/img/folder.ico" />3d</a></li>
				
				<li><a href="/tag/bioinformatics/" title="bioinformatics"><img src="/assets/img/folder.ico" />bioinformatics</a></li>
				
				<li><a href="/tag/notes/" title="notes"><img src="/assets/img/folder.ico" />notes</a></li>
				
				<li><a href="/tag/others/" title="others"><img src="/assets/img/folder.ico" />others</a></li>
				
				<li><a href="/tag/sci-fiction/" title="sci fiction"><img src="/assets/img/folder.ico" />sci fiction</a></li>
				
			</ul>
				</li>
			</ul>
		</div>
		<div class="post_list">
			
				<ul>
					
					<li>
						<a href="https://ani-net-project.gitee.io/index.html" title="AniNet">
								<img class="small-icon" src="/assets/img/aninet.png" title="AniNet" />AniNet
						</a>
					</li>
					
					<li>
						<a href="/examples/boids/index.html" title="Boids">
								<img class="small-icon" src="/assets/img/bird.png" title="Boids" />Boids
						</a>
					</li>
					
					<li>
						<a href="/examples/EM-alg/1_how_it_works.html" title="EM 算法 1. how it works">
								<img class="small-icon" src="/assets/img/notebook.ico" title="EM 算法 1. how it works" />EM 算法 1. how it works
						</a>
					</li>
					
					<li>
						<a href="/examples/pyvm/pyvm_ch0-3_cn.html" title="翻译：Inside Python Virtual Machine（前三章）">
								<img class="small-icon" src="/assets/img/html_ie.ico" title="翻译：Inside Python Virtual Machine（前三章）" />翻译：Inside Python Virtual Machine（前三章）
						</a>
					</li>
					
					<li>
						<a href="/20190624/cLife.html" title="器官工业幻想">
								<img class="small-icon" src="/assets/img/file.ico" title="器官工业幻想" />器官工业幻想
						</a>
					</li>
					
					<li>
						<a href="/examples/pubnet/Network_statistic.html" title="PubNet network statistic">
								<img class="small-icon" src="/assets/img/notebook.ico" title="PubNet network statistic" />PubNet network statistic
						</a>
					</li>
					
					<li>
						<a href="/examples/pubnet/sample.html" title="PubMed bio-conception network example">
								<img class="small-icon" src="/assets/img/net.png" title="PubMed bio-conception network example" />PubMed bio-conception network example
						</a>
					</li>
					
					<li>
						<a href="/examples/hpo_enrich/example_sagd_00055.html" title="HPO enrichment example">
								<img class="small-icon" src="/assets/img/notebook.ico" title="HPO enrichment example" />HPO enrichment example
						</a>
					</li>
					
					<li>
						<a href="/20190222/KS-Test.html" title="Two sample Kolmogorov-Smirnov test">
								<img class="small-icon" src="/assets/img/file.ico" title="Two sample Kolmogorov-Smirnov test" />Two sample Kolmogorov-Smirnov test
						</a>
					</li>
					
					<li>
						<a href="/20190220/threejs-test-page.html" title="three.js test page">
								<img class="small-icon" src="/assets/img/tree.png" title="three.js test page" />three.js test page
						</a>
					</li>
					
					<li>
						<a href="/20181111/hic_data_format.html" title="Hi-C 数据分析结果应该怎么存？">
								<img class="small-icon" src="/assets/img/file.ico" title="Hi-C 数据分析结果应该怎么存？" />Hi-C 数据分析结果应该怎么存？
						</a>
					</li>
					
					<li>
						<a href="/20181010/d3_bubble_chart.html" title="用 D3.js 画一个 bubble chart">
								<img class="small-icon" src="/assets/img/file.ico" title="用 D3.js 画一个 bubble chart" />用 D3.js 画一个 bubble chart
						</a>
					</li>
					
					<li>
						<a href="/20180724/new_kind_slides.html" title="论制作 Slides 的几种姿势">
								<img class="small-icon" src="/assets/img/file.ico" title="论制作 Slides 的几种姿势" />论制作 Slides 的几种姿势
						</a>
					</li>
					
					<li>
						<a href="/slides/test/slideshow.html" title="Markdown Slides Test">
								<img class="small-icon" src="/assets/img/slides.png" title="Markdown Slides Test" />Markdown Slides Test
						</a>
					</li>
					
					<li>
						<a href="/20180221/bioview.html" title="bioView - 一个生信常用文件格式的可读性增强工具">
								<img class="small-icon" src="/assets/img/file.ico" title="bioView - 一个生信常用文件格式的可读性增强工具" />bioView - 一个生信常用文件格式的可读性增强工具
						</a>
					</li>
					
					<li>
						<a href="/20170916/markdown-test-page.html" title="Markdown Test Page">
								<img class="small-icon" src="/assets/img/file.ico" title="Markdown Test Page" />Markdown Test Page
						</a>
					</li>
					
					<li>
						<a href="/20170831/gol-js.html" title="一个JS实现的生命游戏">
								<img class="small-icon" src="/assets/img/file.ico" title="一个JS实现的生命游戏" />一个JS实现的生命游戏
						</a>
					</li>
					
					<li>
						<a href="/20170831/parallel.html" title="这大概是程序串行改并行最简单粗暴的方法">
								<img class="small-icon" src="/assets/img/file.ico" title="这大概是程序串行改并行最简单粗暴的方法" />这大概是程序串行改并行最简单粗暴的方法
						</a>
					</li>
					
					<li>
						<a href="/20170730/learn-docker.html" title="学习Docker">
								<img class="small-icon" src="/assets/img/file.ico" title="学习Docker" />学习Docker
						</a>
					</li>
					
					<li>
						<a href="/20170530/learn-assemble.html" title="学习汇编语言">
								<img class="small-icon" src="/assets/img/file.ico" title="学习汇编语言" />学习汇编语言
						</a>
					</li>
					
					<li>
						<a href="/20170505/schoolnet.html" title="如何在非校园网环境下使用学校文献数据库">
								<img class="small-icon" src="/assets/img/file.ico" title="如何在非校园网环境下使用学校文献数据库" />如何在非校园网环境下使用学校文献数据库
						</a>
					</li>
					
					<li>
						<a href="/20170308/zerotier.html" title="使用ZeroTier搭建虚拟局域网">
								<img class="small-icon" src="/assets/img/file.ico" title="使用ZeroTier搭建虚拟局域网" />使用ZeroTier搭建虚拟局域网
						</a>
					</li>
					
					<li>
						<a href="/20161224/hy-in-brief.html" title="Python生态下的Lisp方言">
								<img class="small-icon" src="/assets/img/file.ico" title="Python生态下的Lisp方言" />Python生态下的Lisp方言
						</a>
					</li>
					
					<li>
						<a href="/20161210/scrapy_douban.html" title="使用Scrapy爬取豆瓣相册">
								<img class="small-icon" src="/assets/img/file.ico" title="使用Scrapy爬取豆瓣相册" />使用Scrapy爬取豆瓣相册
						</a>
					</li>
					
					<li>
						<a href="/20161115/speed_up_python.html" title="加速Python">
								<img class="small-icon" src="/assets/img/file.ico" title="加速Python" />加速Python
						</a>
					</li>
					
				</ul>
			
		</div>
		<div class="post_total">
			
				<div class="left">25 object(s)</div>
			
			<div class="right">&nbsp;</div>
		</div>
	</div>
	
        <div class="content">
			<div class="post_title">
				<img src="/assets/img/file.png" />
				<h1>Two sample Kolmogorov-Smirnov test</h1>
				<a href="/"><div class="btn"><span class="fa fa-times"></span></div></a>
				<div class="btn btn_max"><span class="fa fa-window-maximize"></span></div>
				<div class="btn"><span class="fa fa-window-minimize"></span></div>
			</div>
			<ul class="topbar">
				<li>February 22, 2019</li>
			</ul>
			<div class="post_content" style="max-height: 600px">
				<div class="post_content_inner">
        		<p>The KS-Test(Kolmogorov-Smirnov test) is a kind of non-parametric
(the distribution information of the data is not needed)
test for determine if two datasets differ significantly.
The null hypothesis is that the two samples are drawn
from the same distribution.</p>

<p>The key idea of KS-Test is to construct a statistic
(which called D statistic)
by the cumulative fraction function
(empirical distribution function) of two datasets.
According to the definition on the
<a href="https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test">KS-Test WiKi page</a>:</p>

<p>The CDF(cumulative fraction function) $F_n$:</p>

<script type="math/tex; mode=display">F_n(x) = \frac{1}{n} \sum^{n}_{i=1} I_{[-\infty,x]}(X_i)</script>

<p>Where the $I_{[-\infty,x]}$ is the indicator function, it
equal to 1 if $X_i \leq x$ else it equal to 0.
Then the D statistic:</p>

<script type="math/tex; mode=display">D_{n,m} = \mathop{sup}_{x} | F_{1,n}(x) - F_{2,n}(x) |</script>

<p>Where $F_{1,n}$ and $F_{2,n}$ are the CDF
of the first and second dataset, and
$\mathop{sup}_{x}$ is the supremum function.</p>

<p>Intuitively, the D statistic is the max difference value of
two cumulative functions. as shown in following figure:</p>

<p><img src="https://upload.wikimedia.org/wikipedia/commons/3/3f/KS2_Example.png" alt="" width="40%" height="40%" /></p>

<p>The red line is the CDF of a dataset, the blue line is another one,
and the black arrow line indicate the max distance between them,
that is the statistic D.</p>

<p>Benefit the design of the statistic D, KS-Test is sensitive to 
differences in both location and shape
of the CDF of the two samples.
We can do some experiments to demonstrate this feature.</p>

<h2 id="experiments">Experiments</h2>

<p>Here, we use the implementation of two sample KS-Test in scipy.</p>

<p>Firstly, we generate two datasets from two
normal distribution they have same $\mu$ but
different $\sigma$.</p>

<pre><code class="language-Python">&gt;&gt;&gt; import numpy as np
&gt;&gt;&gt; np.random.seed(42)
&gt;&gt;&gt; s1 = np.random.normal(0, 1, 200)
&gt;&gt;&gt; s2 = np.random.normal(0, 1.5, 250)
</code></pre>

<p>We can check the distrinution of them with hist plot:</p>

<pre><code class="language-Python">&gt;&gt;&gt; import matplotlib.pyplot as plt
&gt;&gt;&gt; _, _, h1 = plt.hist(s1, 20, density=True, alpha=0.6)
&gt;&gt;&gt; _, _, h2 = plt.hist(s2, 20, density=True, alpha=0.6)
&gt;&gt;&gt; plt.legend([h1[0], h2[0]], ['sample1', 'sample2'])
&gt;&gt;&gt; plt.show()
</code></pre>

<p><img src="/images/blog/ks_test_1.png" alt="" width="50%" height="50%" /></p>

<p>Next, we do the t-test to compare these two samples:</p>

<pre><code class="language-Python">&gt;&gt;&gt; from scipy.stats import ttest_ind
&gt;&gt;&gt; ttest_ind(s1, s2)
Ttest_indResult(statistic=-1.0619794453886573, pvalue=0.2888171520114434)
&gt;&gt;&gt; ttest_ind(s1, s2, equal_var=False)
Ttest_indResult(statistic=-1.116168623094328, pvalue=0.2649835475758269)
</code></pre>

<p>The p-value is too high,
we can not reject the null hypothesis at 0.05 level.
Two data set are different,
but the t-test cannot see the difference.</p>

<p>But if we do the KS-Test:</p>

<pre><code class="language-Python">&gt;&gt;&gt; from scipy.stats import ks_2samp
&gt;&gt;&gt; ks_2samp(s1, s2)
Ks_2sampResult(statistic=0.17499999999999993, pvalue=0.0018698874195532526)
</code></pre>

<p>The difference was be shown by the small p-value.</p>

<p>More strictly, repeat the random sampling the
caluculate the p-value many times.
See the distribution of p-value get
from the t-test and the KS-Test:</p>

<pre><code class="language-Python">np.random.seed(42)
n = 200
mean1 = 1
mean2 = 1
sigma1 = 1
sigma2 = 1.5

p_ttest = []
p_kstest = []

for _ in range(300):
    s1 = np.random.normal(mean1, sigma1, n)
    s2 = np.random.normal(mean2, sigma2, n)
    r_t = ttest_ind(s1, s2)
    r_ks = ks_2samp(s1, s2)
    p_ttest.append(r_t.pvalue)
    p_kstest.append(r_ks.pvalue)

fig, ax = plt.subplots(figsize=(6, 5))
plt.boxplot([p_ttest, p_kstest], labels=['t-test', 'ks-test'])
plt.hlines(y=0.05, xmin=0, xmax=3, color='blue', linestyles='--', linewidth=0.5)
plt.ylabel("p-value")
</code></pre>

<p><img src="/images/blog/ks_test_2.png" alt="" /></p>

<p>From this we know that the KS-Test is
a good tool for compare two datasets,
when the mean value of them are similar.</p>

				
					<br>
<hr>
<br>
<div class="donate">
	<p>Email: nanguage@yahoo.com</p>
</div>
				
				</div>
			</div>
		</div>
    
	<script src="/assets/js/001.js"></script>
	<script src="/assets/js/002.js"></script>
	<div class="footer">
		<p>blog theme: <a href="https://github.com/h01000110/windows-95">win95</a></p>
	</div>
</body>
</html>